In this lab, as in the next two, we will explore different classification methods, and how we can compare their results. For those three labs, we will use the MNIST database of hand-written digits. In this lab, we will use decision trees and random forest methods. In Lab 6, we will use neural networks, and in Lab 7 Support Vector Machines.
For this lab and the next two, students must write a report (one report for the three labs) which will be used during the oral exam. This report should highlight the different methods used during the labs, but also how you validated each method and compared their results.
The dataset is the MNIST database of handwritten digits, from LeCun et al : http://yann.lecun.com/exdb/mnist/ It contains a training set of 55 000 examples, and a test set of 10 000 examples.
The dataset is split into a training set (for learning and cross-validation) and a test set (for evaluation of the model). Each 28x28 pixels image is flattened into a 784-length vector ('images') and the correct label is encoded in a 10-length vector ('labels'). The following piece of code shows a sample of the training set with the correct label.
%matplotlib inline
from matplotlib import pyplot as plt
from MNISTData import MNISTData
from sklearn.metrics import confusion_matrix
import numpy as np
mnist = MNISTData(train_dir='MNIST_data', one_hot=True)
plt.gray()
for i in range(9):
plt.subplot(3,3,i+1)
plt.imshow(mnist.train['images'][i].reshape((28,28)))
plt.title(mnist.train['labels'][i].argmax())
plt.axis('off')
plt.show()
def evaluate_classifier(clf, test_data, test_labels):
pred = clf.predict(test_data)
C = confusion_matrix(pred.argmax(axis=1), test_labels.argmax(axis=1))
return C.diagonal().sum()*100./C.sum(),C
Extracting MNIST_data\train-images-idx3-ubyte.gz Extracting MNIST_data\train-labels-idx1-ubyte.gz Extracting MNIST_data\t10k-images-idx3-ubyte.gz Extracting MNIST_data\t10k-labels-idx1-ubyte.gz
Training classifier on the entire MNIST dataset takes a long time. In order to be able to test different methods during the time of a laboratory session, we'll only use 1/10th of the dataset.
Create a simple Decision Tree classifier using scikit-learn and train it on the MNIST dataset. Use it to predict the classes of the test dataset.
Evaluate the performance of the classifier on the test dataset.
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
# Using the StratifiedKFold function to only use 1/10th of the dataset, while keeping the class distribution.
from sklearn.cross_validation import StratifiedKFold
kf = StratifiedKFold( mnist.train['labels'].argmax(axis=1), 10 )
for big,small in kf:
data_x = mnist.train['images'][small]
data_y = mnist.train['labels'][small]
break
from sklearn import tree
clf = tree.DecisionTreeClassifier()
# --- your code here --- #
s,C = evaluate_classifier(clf, mnist.test['images'], mnist.test['labels'])
print(s)
print(C)
How could you improve the results of this classifier ? What parameters can you change ? Can you pre-process the data ? Try to improve the results using cross-validation.
Evaluate your best classifier on the test set. How can you compare it to the classifier with default parameters ?
# --- your code here --- #
Random Forest classifiers use multiple decision trees trained on sub-samples of the dataset, averaging the results so as to reduce over-fitting.
Use scikit-learn to create Random Forest classifiers on the MNIST data. Use cross-validation to test different parameters, and evaluate your best classifier on the test set. Compare the results with the previous classifiers.
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
from sklearn import ensemble
clf = ensemble.RandomForestClassifier()
# --- your code here --- #